As football enthusiasts, we are keen on analyzing games in great depth. Fortunately, the growing availability of open football data provides many opportunities to do so. This was one of the main motivations for our project. We set out to explore how shots have evolved over time, and whether we could build our own expected goals (xG) model — a tool widely used in football analytics to estimate the probability of a shot resulting in a goal.
To do this, we use open-access data provided by StatsBomb. For each event in a match (like a shot, pass, dribble, save, or duel), StatsBomb provides detailed event-level information on what occurred on the pitch. We have this data across 13 competitions for a wide timeframe, from 1958 to 2024.
Ultimately, our goal is to test the hypothesis that closer shots have a higher chance of resulting in goals, and to see whether this insight is reflected in trends over time. In other words: are teams becoming more conservative, favoring high-probability chances over long-range efforts? In the end, we might see this being reflected in our xG model. The code for this project can also be found on GitHub.
For this project, we will rely on the package StatsBombR. This package gives us access to detailed, event-level football data — including passes, shots, dribbles, and more — from professional matches around the world.
In general, the package is used in the following way:
In general, we can use the following code to obtain the data via the package’s API. We use the API to load the data to power the heatmaps and shot charts we see in Section II, which only use FIFA World Cup data. However, the dataset that we use to train our model is very big and covers more competitions, so we decided to attach it as a CSV. Nevertheless, the code we used to obtain the model training data is commented out below:
# #Pulling StatsBomb Free Data Into R
# library(tidyverse)
# library(StatsBombR)
# Comps <- FreeCompetitions()
# comps_shots <- Comps %>%
# filter(
# competition_gender == 'male',
# !competition_name %in% c('FIFA U20 World Cup', 'Indian Super league', 'Major League Soccer', 'North American League')
# )
#
# Matches <- FreeMatches(Comps)
#
# Matches_Shots <- Matches
#
# Matches_Passes <- Matches %>%
# filter(year(match_date) >= 2000)
#
# StatsBombData_Shots <- free_allevents(MatchesDF = Matches_Shots, Parallel = T)
# StatsBombData_Passes <- free_allevents(MatchesDF = Matches_Passes, Parallel = T)
#
# StatsBombData_Shots = allclean(StatsBombData_Shots)
# StatsBombData_Passes = allclean(StatsBombData_Passes)
#
# shots <- StatsBombData_Shots %>%
# filter(type.name == "Shot", !is.na(location)) %>%
# unnest_wider(location, names_sep = "_") %>%
# rename(x = location_1, y = location_2)
#
# passes <- StatsBombData_Passes %>%
# filter(type.name == "Pass", !is.na(location)) %>%
# unnest_wider(location, names_sep = "_") %>%
# rename(x = location_1, y = location_2)
#
# shots <- shots %>%
# left_join(
# Matches %>%
# select(match_id, match_date),
# by = "match_id"
# ) %>%
# left_join(
# Comps %>%
# select(competition_id, season_id, competition_name, season_name),
# by = c("competition_id", "season_id")
# ) %>%
# mutate(match_date = as.Date(match_date))
#
# shots_clean <- shots %>%
# select(
# -carry.end_location,
# -goalkeeper.end_location,
# -tactics.lineup,
# -related_events,
# -shot.freeze_frame,
# -pass.end_location
# ) %>%
# unnest_wider(shot.end_location, names_sep = "_") %>%
# rename(
# shot.end_x = shot.end_location_1,
# shot.end_y = shot.end_location_2
# )
#
# passes <- passes %>%
# left_join(
# Matches %>%
# select(match_id, match_date),
# by = "match_id"
# ) %>%
# left_join(
# Comps %>%
# select(competition_id, season_id, competition_name, season_name),
# by = c("competition_id", "season_id")
# ) %>%
# mutate(match_date = as.Date(match_date))
#
# passes_clean <- passes %>%
# select(
# -carry.end_location,
# -goalkeeper.end_location,
# -tactics.lineup,
# -related_events,
# -shot.end_location,
# -shot.freeze_frame
# ) %>%
# unnest_wider(pass.end_location, names_sep = "_") %>%
# rename(
# pass.end_x = pass.end_location_1,
# pass.end_y = pass.end_location_2
# )
#
# write_csv(shots_clean, "shots.csv")
# write_csv(passes_clean, "passes.csv")
Another important point is that we filtered our shots data to just include major men’s competitions. This leaves us with about 70,000 shots to train our model with.This was the only possibility to get data as it was too big otherwise.
On the other hand, the data for passes only includes 2 specific seasons for computational efficiency - and this is already includes about 200,000 observations.
shots_df <- read_csv(unz("shots_new.csv.zip", "shots_new.csv"))
## Rows: 70553 Columns: 121
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): id, play_pattern.name, player.name, position.name, possession_tea...
## dbl (15): duration, index, location.x, location.y, match_id, minute, period...
## lgl (90): 50_50, bad_behaviour_card, ball_receipt_outcome, ball_recovery_re...
## date (1): match_date
## time (1): timestamp
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
passes_df <- read_csv(unz("passes.csv.zip", "passes.csv"))
## Rows: 198824 Columns: 180
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): id, type.name, possession_team.name, play_pattern.name, team.nam...
## dbl (36): index, period, minute, second, possession, duration, x, y, type....
## lgl (126): under_pressure, counterpress, out, off_camera, tactics.formation...
## date (1): match_date
## time (1): timestamp
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Before addressing our research question, we shall do some some preliminary data exploration to understand and get an overview of the data.
names(shots_df)
## [1] "50_50" "bad_behaviour_card"
## [3] "ball_receipt_outcome" "ball_recovery_recovery_failure"
## [5] "block_deflection" "block_save_block"
## [7] "clearance_aerial_won" "clearance_body_part"
## [9] "clearance_head" "clearance_left_foot"
## [11] "clearance_right_foot" "counterpress"
## [13] "dribble_nutmeg" "dribble_outcome"
## [15] "duel_outcome" "duel_type"
## [17] "duration" "foul_committed_advantage"
## [19] "foul_committed_card" "foul_committed_penalty"
## [21] "foul_won_advantage" "foul_won_defensive"
## [23] "foul_won_penalty" "goalkeeper_body_part"
## [25] "goalkeeper_outcome" "goalkeeper_position"
## [27] "goalkeeper_technique" "goalkeeper_type"
## [29] "id" "index"
## [31] "injury_stoppage_in_chain" "interception_outcome"
## [33] "location.x" "location.y"
## [35] "match_id" "minute"
## [37] "off_camera" "out"
## [39] "pass_aerial_won" "pass_angle"
## [41] "pass_assisted_shot_id" "pass_body_part"
## [43] "pass_cross" "pass_cut_back"
## [45] "pass_deflected" "pass_goal_assist"
## [47] "pass_height" "pass_inswinging"
## [49] "pass_length" "pass_outcome"
## [51] "pass_outswinging" "pass_recipient"
## [53] "pass_recipient_id" "pass_shot_assist"
## [55] "pass_switch" "pass_technique"
## [57] "pass_through_ball" "pass_type"
## [59] "period" "play_pattern.name"
## [61] "player.name" "player.id"
## [63] "position.name" "possession_team.name"
## [65] "possession_team" "possession_team_id"
## [67] "second" "shot.aerial_won"
## [69] "shot.body_part.name" "shot.end_x"
## [71] "shot.end_y" "shot.first_time"
## [73] "shot.key_pass_id" "shot.one_on_one"
## [75] "shot.outcome.name" "shot.statsbomb_xg"
## [77] "shot.technique.name" "shot.type.name"
## [79] "substitution_outcome" "substitution_outcome_id"
## [81] "substitution_replacement" "substitution_replacement_id"
## [83] "tactics" "team.name"
## [85] "team_id" "timestamp"
## [87] "type" "under_pressure"
## [89] "ball_recovery_offensive" "block_offensive"
## [91] "dribble_overrun" "foul_committed_type"
## [93] "miscontrol_aerial_won" "foul_committed_offensive"
## [95] "shot.deflected" "pass_miscommunication"
## [97] "pass_no_touch" "goalkeeper_punched_out"
## [99] "pass_straight" "shot.open_goal"
## [101] "dribble_no_touch" "goalkeeper_shot_saved_off_target"
## [103] "shot_saved_off_target" "goalkeeper_shot_saved_to_post"
## [105] "shot_saved_to_post" "clearance_other"
## [107] "goalkeeper_success_out" "goalkeeper_success_in_play"
## [109] "shot.redirect" "goalkeeper_lost_out"
## [111] "shot.follows_dribble" "half_start_late_video_start"
## [113] "player_off_permanent" "pass_backheel"
## [115] "goalkeeper_lost_in_play" "half_end_early_video_end"
## [117] "goalkeeper_penalty_saved_to_post" "goalkeeper_saved_to_post"
## [119] "match_date" "competition_name"
## [121] "season_name"
head(shots_df[, -c(1,2,3)])
## # A tibble: 6 × 118
## ball_recovery_recover…¹ block_deflection block_save_block clearance_aerial_won
## <lgl> <lgl> <lgl> <lgl>
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## # ℹ abbreviated name: ¹ball_recovery_recovery_failure
## # ℹ 114 more variables: clearance_body_part <lgl>, clearance_head <lgl>,
## # clearance_left_foot <lgl>, clearance_right_foot <lgl>, counterpress <lgl>,
## # dribble_nutmeg <lgl>, dribble_outcome <lgl>, duel_outcome <lgl>,
## # duel_type <lgl>, duration <dbl>, foul_committed_advantage <lgl>,
## # foul_committed_card <lgl>, foul_committed_penalty <lgl>,
## # foul_won_advantage <lgl>, foul_won_defensive <lgl>, …
As we can see, there are in total 180 columns and 70,553 observations. These observations are all seasons for most tournaments but filtered for only shots. The observations for passes on the other hand can be found in the other dataset.
Before we start the analysis, we do some basic data pre processing and also remove some of these redundant 186 columns that we do not require.
shots <- shots_df %>%
dplyr::select(
id,
match_id,
competition_name,
season_name,
timestamp,
minute,
second,
period,
#team.id,
team.name,
possession_team.name,
player.id,
player.name,
#position.id,
position.name,
#play_pattern.id,
play_pattern.name,
under_pressure,
location.x,
location.y,
shot.end_x,
shot.end_y,
#shot.end_location.z,
#shot.end_location_3,
#shot_impact_height,
shot.statsbomb_xg,
#shot.outcome.id,
shot.outcome.name,
#shot.technique.id,
shot.technique.name,
#shot.body_part.id,
shot.body_part.name,
#shot.type.id,
shot.type.name,
shot.aerial_won,
shot.redirect,
shot.follows_dribble,
shot.first_time,
shot.open_goal,
shot.deflected,
shot.redirect,
shot.key_pass_id,
#player.id.GK,
#player.name.GK,
#shot.saved_off_target,
#shot.saved_to_post,
#location.x.GK,
#location.y.GK,
#AngleToGoal,
#AngleToKeeper,
#AngleDeviation
)
passes<- passes_df %>%
dplyr::select(
id,
match_id,
competition_name,
season_name,
timestamp,
minute,
second,
period,
possession_team.name,
team.id,
team.name,
player.id,
player.name,
position.name,
play_pattern.name,
under_pressure,
location.x,
location.y,
pass.length,
pass.angle,
pass.end_x,
pass.end_y,
pass.aerial_won,
pass.switch,
pass.cross,
pass.assisted_shot_id,
pass.shot_assist,
pass.inswinging,
pass.deflected,
pass.outswinging,
pass.through_ball,
pass.cut_back,
pass.goal_assist,
pass.recipient.id,
pass.recipient.name,
pass.height.name,
pass.body_part.name,
pass.type.name,
pass.outcome.name,
pass.technique.name,
ball_receipt.outcome.name,
pass.no_touch
)
To make life easier in the next sections, we calculate the shot distance from player to goal. StatsBomb uses a standard pitch coordinate system where:
We use Euclidean distance as our distance metric.
calculate_shot_distance <- function(x, y) {
sqrt((120 - x)^2 + (40 - y)^2) * (105 / 120)}
shots <- shots %>%
mutate(shot_distance = calculate_shot_distance(location.x, location.y))
We do the same for the distance for the distance of the passes, from player to player.
calculate_pass_distance <- function(x1, y1, x2, y2) {
sqrt((x2 - x1)^2 + (y2 - y1)^2) * (105 / 120)
}
passes <- passes %>%
mutate(pass_distance = calculate_pass_distance(location.x, location.y, pass.end_x, pass.end_y))
Similar to how we calculate the shot distance to the goal, we will calculate the shot angle to the goal.
calculate_angle_to_goal <- function(x, y, goal_x = 120, goal_center = 40, goal_width = 7.32) {
# Convert goal half-width from meters to StatsBomb units
half_width <- (goal_width / 2) / 0.875
left_y <- goal_center - half_width
right_y <- goal_center + half_width
# Calculate the angle (in radians) from the shot to each goalpost
angle_left <- atan((left_y - y) / (goal_x - x))
angle_right <- atan((right_y - y) / (goal_x - x))
# The angle to goal is the difference between the two angles
angle_rad <- angle_right - angle_left
# Convert radians to degrees
angle_deg <- angle_rad * (180 / pi)
return(angle_deg)
}
shots <- shots %>%
mutate(angle_to_goal = calculate_angle_to_goal(location.x, location.y))
The shot.key_pass_id links the unique play id for a shot to the unique play id for the pass leading to the shot (if the shot was assisted). We do not need the unique play id for our purposes, but it will be helpful to know whether a shot was assisted or self-generated. We’ll convert it to a true-false column.
shots <- shots %>%
mutate(shot.is_assisted = !is.na(shot.key_pass_id)) %>% # Create new column based on whether shot.key_pass_id is not null
relocate(shot.is_assisted, .after = shot.key_pass_id) %>% # Move is_assisted right after shot.key_pass_id
dplyr::select(-shot.key_pass_id)
shots %>%
group_by(shot.outcome.name) %>%
summarise(avg_distance = mean(shot_distance, na.rm = TRUE),
count = n()) %>%
arrange(avg_distance)
## # A tibble: 8 × 3
## shot.outcome.name avg_distance count
## <chr> <dbl> <int>
## 1 Goal 11.4 7821
## 2 Post 14.8 1446
## 3 Saved to Post 14.9 218
## 4 Wayward 15.0 3788
## 5 Saved 17.1 16818
## 6 Off T 17.9 22932
## 7 Blocked 18.5 17284
## 8 Saved Off Target 21.0 246
As we can see above, the average distance when scoring a goal is much lower compared to the other shot outcomes such as being blocked or being saved.
passes %>%
mutate(pass.outcome.name = replace_na(pass.outcome.name, "Complete")) %>%
group_by(pass.outcome.name) %>%
summarise(
avg_distance = mean(pass_distance, na.rm = TRUE),
count = n()
) %>%
arrange(avg_distance)
## # A tibble: 6 × 3
## pass.outcome.name avg_distance count
## <chr> <dbl> <int>
## 1 Complete 17.1 154606
## 2 Incomplete 23.6 38095
## 3 Pass Offside 27.7 852
## 4 Unknown 28.4 1419
## 5 Injury Clearance 29.6 325
## 6 Out 33.6 3527
Similarly, the average distance for completed passes is much, much shorter, than passes that are not completed.
Our documentation from StatsBomb states that all completed passes have a null outcome name, so we input the completed pass value in this step as well.
We now create a histogram to look at the shot distance distribution.
hist(shots$shot_distance, breaks = 30, main = "Shot Distance Distribution", xlab = "Distance to Goal (meters)")
Most importantly, the histogram shows how far most shots are taken from the goal. We observe a clear peak at shorter distances, indicating that many shots come from inside or near the penalty box. However, there are also a number of longer-range attempts, suggesting variation in shooting strategy across players or teams.
For data exploration purposes, we create a plot to observe the 10 players for which we have the most shot data.
shots %>%
count(player.name, sort = TRUE) %>%
slice_max(n, n = 10) %>%
ggplot(aes(x = reorder(player.name, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 Players by Total Shot Volume", x = "Player", y = "Shots")
As a special initiative for football fans, StatsBomb released large selection of free data for Messi, which partially explains why we have far more shots for him than any other player.
We can do the same for the pass distances.
hist(passes$pass_distance, breaks = 50,
main = "Distribution of Pass Distances",
xlab = "Pass Distance (meters)",
col = "lightblue", border = "white")
Let’s plot this over time:
For convenience, we include a trendline using LOESS (Locally Estimated Scatterplot Smoothing), regression method that captures the underlying trend - without assuming a strictly linear relationship.
avg_shot_distance_by_year <- shots %>%
group_by(season_name) %>%
summarise(
avg_shot_distance = mean(shot_distance, na.rm = TRUE),
shot_count = n()
) %>%
filter(!is.na(season_name))
ggplot(avg_shot_distance_by_year, aes(x = season_name, y = avg_shot_distance, group = 1)) +
geom_line(size = 1.2, color = "darkred") +
geom_point(size = 2, color = "black") +
geom_smooth(method = "loess", se = FALSE, color = "steelblue", linetype = "dashed") +
labs(
title = "Average Shot Distance Over Time",
x = "Season",
y = "Average Distance to Goal (in pitch units)"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
This plot shows how most shots are not within the penalty or 11m box from the goal line. Interestingly, we can see how the shot distance started to increase on average from 1970 on. However, the trend seems to have stabilized at around 19 meters, indicating a potential tactical equilibrium — where players still attempt long-range shots, but the majority of efforts come from a more optimal shooting range. It will be interesting to see if we also observe this in our xG model.
The plot for the average distance of passes can also be seen below but it is not as meaningful as we only have data for 2 seasons. We can see a sharp decrease in distance.
avg_pass_distance_by_year <- passes %>%
group_by(season_name) %>%
summarise(
avg_pass_distance = mean(pass_distance, na.rm = TRUE),
pass_count = n()
) %>%
filter(!is.na(season_name))
ggplot(avg_pass_distance_by_year, aes(x = season_name, y = avg_pass_distance, group = 1)) +
geom_line(size = 1.2, color = "darkgreen") +
geom_point(size = 2, color = "black") +
geom_smooth(method = "loess", se = FALSE, color = "steelblue", linetype = "dashed") +
labs(
title = "Average Pass Distance Over Time",
x = "Season",
y = "Average Pass Distance (m)"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `geom_smooth()` using formula = 'y ~ x'
We will observe this more closely by taking a look at the distribution of each year. For this, we create a density plot faceted by season below.
ggplot(shots, aes(x = shot_distance)) +
geom_density(fill = "lightblue") +
facet_wrap(~season_name) + # no `scales = "free_y"`
labs(
title = "Distribution of Shot Distances per Season",
x = "Shot Distance (m)",
y = "Density"
) +
theme_minimal() +
theme(
axis.text.x = element_text(size = 6),
strip.text = element_text(size = 8)
)
However, the trend is more difficult to observe here. We can see how each season has a different pattern/shape and there is no universal rule or distribution that applies to all seasons.
ggplot(passes, aes(x = pass_distance)) +
geom_density(fill = "lightblue") +
facet_wrap(~season_name) +
labs(
title = "Distribution of Pass Distances per Season",
x = "Pass Distance (m)",
y = "Density"
) +
theme_minimal() +
theme(
axis.text.x = element_text(size = 6),
strip.text = element_text(size = 8)
)
For passes on the other hand, it is way more obvious. We can definitely see that short passes are more likely in the 2023/2024 season than in the 2015/2016 season!
We first filtered out penalty shots and ensured all shot coordinates
stayed within the pitch boundaries. Then, we binned shot locations into
10x10 zones and extracted the first year from each
season_name to assign a corresponding decade. We kept only
shots from selected decades (1970s to 2020s). Finally, we grouped the
data by bin and decade, counted the number of shots per bin, and
calculated the proportion of shots (shot_prop) within each
decade to allow normalized comparisons in the heatmap.
library(stringr)
# Ensure shot locations are within pitch boundaries and compute bins + decade
heatmap <- shots %>%
filter(
(shot.type.name != "Penalty" | is.na(shot.type.name))
) %>%
mutate(
location.x = pmax(0, pmin(120, location.x)),
location.y = pmax(0, pmin(80, location.y)),
xbin = cut(location.x, breaks = seq(0, 120, by = 10), include.lowest = TRUE, labels = FALSE),
ybin = cut(location.y, breaks = seq(0, 80, by = 10), include.lowest = TRUE, labels = FALSE),
first_year = as.numeric(str_extract(season_name, "\\d{4}")),
decade = paste0(floor(first_year / 10) * 10, "s")
) %>%
filter(
decade %in% c("1970s", "1980s", "1990s", "2000s", "2010s", "2020s")
)
heatmap_summary <- heatmap %>%
group_by(xbin, ybin, decade) %>%
summarise(
shot_count = n(),
location.x = (xbin - 1) * 10 + 5,
location.y = (ybin - 1) * 10 + 5,
.groups = "drop"
) %>%
group_by(decade) %>%
mutate(
shot_prop = shot_count / sum(shot_count)
) %>%
ungroup()
This function plots shot density heatmaps by decade using normalized shot proportions. It maps shot intensity to a color gradient on a football pitch and generates one plot per selected decade.
plot_heatmap_by_decade <- function(data, decades = c("1970s", "2010s", "2020s")) {
ggplot(data = data %>% filter(decade %in% decades),
aes(x = location.x, y = location.y, fill = shot_prop)) +
geom_tile(width = 10, height = 10, alpha = 0.9, color = "black") +
# Pitch
annotate("rect", xmin = 0, xmax = 120, ymin = 0, ymax = 80, fill = NA, colour = "black", size = 0.6) +
annotate("rect", xmin = 18, xmax = 0, ymin = 18, ymax = 62, fill = NA, colour = "black", size = 0.6) +
annotate("rect", xmin = 102, xmax = 120, ymin = 18, ymax = 62, fill = NA, colour = "black", size = 0.6) +
annotate("segment", x = 60, xend = 60, y = 0, yend = 80, colour = "black", size = 0.6) +
annotate("point", x = 60, y = 40, colour = "black", size = 1.5) +
annotate(
"path",
x = 60 + 10 * cos(seq(0, 2 * pi, length.out = 100)),
y = 40 + 10 * sin(seq(0, 2 * pi, length.out = 100)),
colour = "black", size = 0.6
) +
scale_fill_gradientn(
colours = c("white", "green", "yellow", "red"),
values = scales::rescale(c(0, 0.5, 1)),
name = "Shot %",
labels = scales::label_percent(scale = 1)
) +
scale_y_reverse() +
labs(
title = "Shot Density Map",
subtitle = paste("Decades:", paste(decades, collapse = ", "))
) +
coord_fixed(ratio = 95 / 100) +
facet_wrap(~decade, ncol = 2) +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
legend.title = element_blank(),
legend.position = "bottom",
plot.background = element_rect(fill = "white"),
panel.grid = element_blank(),
axis.ticks = element_blank(),
plot.margin = margin(1, 1, 1, 1),
strip.text = element_text(size = 12, face = "bold")
)
}
# Plot only for 1970s and 2020s
plot_heatmap_by_decade(heatmap_summary, decades = c("1970s"))
plot_heatmap_by_decade(heatmap_summary, decades = c("2010s"))
plot_heatmap_by_decade(heatmap_summary, decades = c("2020s"))
Across decades, we observed a clear shift in shot patterns. In the 1970s, shots were mostly central and close to goal. By the 2000s and especially the 2020s, shot locations became more varied, with increased use of wide areas and a continued tendency to shoot from close to the goal, reflecting evolving tactics and more diverse attacking strategies.
Next, we plot a few heatmaps to compare data from the 1970 to 2022 FIFA World Cup to compare shot and pass tends for two distinct tournaments 52 years apart.
We load our 1970 and 2022 FIFA World Cup data.
Join competition data to get seasons (year)
# Filter only shots from StatsBombData
shots_wc <- StatsBombData %>% filter(type.name == "Shot")
passes_wc <- StatsBombData %>% filter(type.name == "Pass")
#Joined competetion to get seasons (for year analysis)
#Joined shots with competions, to obtain seasons
shots_wc <- shots_wc %>% left_join(Comps, by = c("competition_id", "season_id"))
#Joined passes with competions, to obtain seasons
passes_wc <- passes_wc %>%left_join(Comps, by = c("competition_id", "season_id"))
shots_by_season <- shots_wc %>%
group_by(season_name) %>%
summarise(shot_count = n()) %>%
arrange(season_name)
print(shots_by_season)
## # A tibble: 2 × 2
## season_name shot_count
## <chr> <int>
## 1 1970 285
## 2 2022 1494
passes_by_season <- passes_wc %>%
group_by(season_name) %>%
summarise(pass_count = n()) %>%
arrange(season_name)
print(passes_by_season)
## # A tibble: 2 × 2
## season_name pass_count
## <chr> <int>
## 1 1970 5337
## 2 2022 68515
We have much more data available for the 2022 World Cup, which makes sense given the advance of tracking data since 1970.
Shot Density Map in the FIFA World Cup (1970 vs. 2022):
In this analysis, we group shots into a 10x10 meter grid and examine how shooting patterns have evolved in the FIFA World Cup (1970 vs. 2022).
# Ensure shot locations are within pitch boundaries
heatmap <- shots_wc %>%
mutate(
location.x = pmax(0, pmin(120, location.x)),
location.y = pmax(0, pmin(80, location.y))
) %>%
mutate(
xbin = cut(location.x, breaks = seq(0, 120, by = 10), include.lowest = TRUE, labels = FALSE),
ybin = cut(location.y, breaks = seq(0, 80, by = 10), include.lowest = TRUE, labels = FALSE)
)
# Aggregate shots by bin and season
heatmap_summary <- heatmap %>%
group_by(xbin, ybin, season_name) %>%
summarise(
shot_count = n(),
location.x = (xbin - 1) * 10 + 5, # Center each bin
location.y = (ybin - 1) * 10 + 5,
.groups = "drop"
)
This heatmap visualizes shot density across different FIFA World Cup seasons by binning shot locations into a 10x10 meter grid, using a color scale ranging from blue to red to represent shot concentration—red areas indicate high shot frequency—and employing a facet wrap by season to facilitate a direct comparison of shooting patterns across the tournaments.
ggplot(data = heatmap_summary, aes(x = location.x, y = location.y, fill = shot_count)) +
geom_tile(width = 10, height = 10, alpha = 0.9, color = "black") +
# Pitch
annotate("rect", xmin = 0, xmax = 120, ymin = 0, ymax = 80, fill = NA, colour = "black", size = 0.6) +
annotate("rect", xmin = 18, xmax = 0, ymin = 18, ymax = 62, fill = NA, colour = "black", size = 0.6) +
annotate("rect", xmin = 102, xmax = 120, ymin = 18, ymax = 62, fill = NA, colour = "black", size = 0.6) +
annotate("segment", x = 60, xend = 60, y = 0, yend = 80, colour = "black", size = 0.6) +
annotate("point", x = 60, y = 40, colour = "black", size = 1.5) +
annotate(
"path",
x = 60 + 10 * cos(seq(0, 2 * pi, length.out = 100)),
y = 40 + 10 * sin(seq(0, 2 * pi, length.out = 100)),
colour = "black", size = 0.6
) +
scale_fill_gradient(low = "blue", high = "red") +
scale_y_reverse() +
labs(
title = "Shot Density Map 1970-2022",
subtitle = "All Shots – FIFA World Cup"
) +
coord_fixed(ratio = 95 / 100) +
facet_wrap(~season_name, ncol=2) +
theme_minimal()+
theme(
axis.text.x = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
legend.title = element_blank(),
legend.position = "bottom", # Move legend below plot
plot.background = element_rect(fill = "white"),
panel.grid = element_blank(),
axis.ticks = element_blank(),
plot.margin = margin(1, 1, 1, 1),
strip.text = element_text(size = 12, face = "bold") # Bigger facet titles
)
The comparison between 1970 and 2022 reveals a notable shift in shot concentration and tactical evolution, supporting our hypothesis: in 1970, shots were more evenly distributed across various areas—including the midfield and wide positions—with long-range attempts near the center circle suggesting an open-play style and reliance on wide attacks, whereas in 2022, there is a marked concentration of shots inside the penalty box, particularly in central positions (the red zone), reflecting modern tactics that prioritize high-percentage finishing and efficient shot locations, as evidenced by the more intense red zones indicating a higher shot frequency in dangerous areas near the goal.
Before moving on to our final question of how the expected goal model can model this trend, we do some additional analysis on it.
In this section, we will analyze how the expected shots, estimated modeled by the data supplier behave in different years, so over time. This will be the first start on understanding if the expected goal model does also put more value onto the short-distance shots. We will model/plot this by
Brazil 1970 vs 2022: The code below filters shot data for the Brazil national team in the 1970 and 2022 FIFA World Cups, excluding penalties. It also defines a color scale for Expected Goals (xG) to visually represent shot quality. The filtered dataset includes shot locations, xG values, body parts used, and season names.
# Set team of interest
selected.team <- "Brazil"
# Define Expected Goals color scale
shotmapxgcolors <- c("#192780", "#2a5d9f", "#40a7d0", "#87cdcf", "#e7f8e6",
"#f4ef95", "#FDE960", "#FCDC5F", "#F5B94D", "#F0983E",
"#ED8A37", "#E66424", "#D54F1B", "#DC2608", "#BF0000",
"#7F0000", "#5F0000")
# Filter shots for selected team (excluding penalties)
shots.brazil <- shots_wc %>%
filter((shot.type.name != "Penalty" | is.na(shot.type.name)) &
team.name == selected.team & season_name %in% c("1970", "2022")) %>%
dplyr::select(location.x, location.y, shot.statsbomb_xg, shot.body_part.name, season_name)
The visualization below plots this data. We do this by shot locations with colors representing their xG values, while different shapes indicate the body part used (head, left foot, or right foot), and a facet wrap by season allows a direct visual comparison between both eras, ultimately highlighting how Brazil’s shot-taking patterns and quality evolved over time.
# Plot
ggplot() +
# Draw pitch
annotate("rect", xmin = 0, xmax = 120, ymin = 0, ymax = 80, fill = NA, colour = "black", size = 0.6) +
annotate("rect", xmin = 18, xmax = 0, ymin = 18, ymax = 62, fill = NA, colour = "black", size = 0.6) +
annotate("rect", xmin = 102, xmax = 120, ymin = 18, ymax = 62, fill = NA, colour = "black", size = 0.6) +
annotate("point", x = 108, y = 40, colour = "black", size = 1.05) + # Penalty spot
annotate("point", x = 60, y = 40, colour = "black", size = 1.05) + # Center spot
# Plot shots
geom_point(data = shots.brazil,
aes(x = location.x, y = location.y, fill = shot.statsbomb_xg, shape = shot.body_part.name),
size = 2, alpha = 0.8) +
# Color and shape scales
scale_fill_gradientn(colours = shotmapxgcolors, limits = c(0, 0.8), oob = scales::squish, name = "Expected Goals") +
scale_shape_manual(values = c("Head" = 21, "Right Foot" = 23, "Left Foot" = 24), name = "Body Part") +
# Facet by season
facet_wrap(~ season_name) +
# Labels and theme
labs(title = paste(selected.team, "Shot Maps"),
subtitle = "FIFA World Cup, Different Years") +
theme_minimal() +
theme(
legend.position = "top",
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
plot.subtitle = element_text(size = 10, hjust = 0.5),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.margin = margin(1,1,1,1),
aspect.ratio = 65 / 100
) +
coord_flip(xlim = c(85, 125))
In 1970, Brazil relied more on long-range shooting and varied attempts, whereas in 2022, their shot selection improved by favoring higher xG opportunities inside the penalty box. Interestingly, we can already see in 1970 how the Expected Goal value is higher for close distanced shots. Overall, in this trend we are seeing priority on efficiency over volume.
Now, we’ll create an expected goals model, to assess the likelihood of scoring a goal based on pre-shot information.
We load some relevant shot data to prepare our model.
xg_df <- shots %>%
dplyr::select(shot.outcome.name,
play_pattern.name,
under_pressure,
location.x,
location.y,
shot_distance,
shot.technique.name,
shot.body_part.name,
shot.type.name,
shot.aerial_won,
shot.follows_dribble,
shot.first_time,
shot.open_goal,
shot.is_assisted,
#location.x.GK,
#location.y.GK,
angle_to_goal,
competition_name,
season_name,
team.name,
player.name
) %>%
mutate(is_goal = shot.outcome.name == "Goal") %>% # creates TRUE/FALSE
dplyr::select(is_goal, everything(), -shot.outcome.name) # moves is_goal to the first column
xg_df <- xg_df %>%
mutate(is_shot_foot = if_else(shot.body_part.name %in% c("Right Foot", "Left Foot"), TRUE, FALSE)) %>%
relocate(is_shot_foot, .after = shot.body_part.name) %>%
dplyr::select(-shot.body_part.name)
Next we perform categorical encoding on our columns that contain string values.
insert_one_hot <- function(df, col_name) {
# Create dummy variables (as a data frame)
dummies <- as.data.frame(model.matrix(~ . - 1, data = df[col_name]))
# Clean up column names
colnames(dummies) <- gsub(paste0("^", col_name), col_name, colnames(dummies))
colnames(dummies) <- gsub(" ", ".", colnames(dummies)) # Replace spaces if needed
# Get the original column position
pos <- which(names(df) == col_name)
# Build new df: before, dummies, after
df_new <- bind_cols(
df[1:(pos - 1)],
dummies,
df[(pos + 1):ncol(df)]
)
return(df_new)
}
xg_df_encoded <- xg_df # make a copy to preserve original
cols_to_encode <- c("play_pattern.name", "shot.technique.name", "shot.type.name")
for (col in cols_to_encode) {
xg_df_encoded <- insert_one_hot(xg_df_encoded, col)
}
xg_df_encoded <- xg_df_encoded %>%
mutate(across(where(is.logical), ~ if_else(is.na(.), FALSE, .)))
Now, we’re ready to start fitting our expected goals model. We’ll perform a 70-30 training test split on our dataset.
set.seed(42) # for reproducibility
# Create an 70/30 train-test split based on the is_goal variable
train_index <- createDataPartition(xg_df_encoded$is_goal, p = 0.7, list = FALSE)
train_data <- xg_df_encoded[train_index, ]
test_data <- xg_df_encoded[-train_index, ]
Before we start by adding in some of our descriptive variables, we’ll fit a simple univariate logistic regression to measure the impact of shot distance on goals scored.
# Fit logistic regression on the training set
model_distance <- glm(is_goal ~ shot_distance, data = train_data, family = binomial)
# Summarize the model
summary(model_distance)
##
## Call:
## glm(formula = is_goal ~ shot_distance, family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.021230 0.035402 -0.60 0.549
## shot_distance -0.145218 0.002675 -54.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34404 on 49387 degrees of freedom
## Residual deviance: 30516 on 49386 degrees of freedom
## AIC: 30520
##
## Number of Fisher Scoring iterations: 6
Now, let’s add the shot angle to our model.
# Fit logistic regression on the training set (now adding angle)
model_distance_and_angle <- glm(is_goal ~ shot_distance + angle_to_goal, data = train_data, family = binomial)
# Summarize the model
summary(model_distance_and_angle)
##
## Call:
## glm(formula = is_goal ~ shot_distance + angle_to_goal, family = binomial,
## data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.272373 0.091234 -13.95 <2e-16 ***
## shot_distance -0.096463 0.004085 -23.62 <2e-16 ***
## angle_to_goal 0.018460 0.001254 14.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34404 on 49387 degrees of freedom
## Residual deviance: 30289 on 49385 degrees of freedom
## AIC: 30295
##
## Number of Fisher Scoring iterations: 6
We evaluate the two models using the AIC and a likelihood ratio test.
AIC(model_distance, model_distance_and_angle)
## df AIC
## model_distance 2 30519.85
## model_distance_and_angle 3 30295.43
anova(model_distance, model_distance_and_angle, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: is_goal ~ shot_distance
## Model 2: is_goal ~ shot_distance + angle_to_goal
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 49386 30516
## 2 49385 30289 1 226.42 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that the AIC of the model decreases slightly when we add the shot angle.
However, when we look at the likelihood ratio test, we see that the the shot angle p-value is highly significant for whether a shot resulted in a goal.
Now, let’s add whether the shot was taken while under pressure from the defense.
# Fit logistic regression on the training set (now adding whether the shot was under pressure)
model_distance_and_angle_and_pressure <- glm(is_goal ~ shot_distance + angle_to_goal + under_pressure, data = train_data, family = binomial)
# Summarize the model
summary(model_distance_and_angle_and_pressure)
##
## Call:
## glm(formula = is_goal ~ shot_distance + angle_to_goal + under_pressure,
## family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.968114 0.092711 -10.44 <2e-16 ***
## shot_distance -0.106106 0.004139 -25.63 <2e-16 ***
## angle_to_goal 0.018797 0.001271 14.79 <2e-16 ***
## under_pressureTRUE -0.766852 0.037896 -20.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34404 on 49387 degrees of freedom
## Residual deviance: 29837 on 49384 degrees of freedom
## AIC: 29845
##
## Number of Fisher Scoring iterations: 6
We evaluate the three models
AIC(model_distance, model_distance_and_angle, model_distance_and_angle_and_pressure)
## df AIC
## model_distance 2 30519.85
## model_distance_and_angle 3 30295.43
## model_distance_and_angle_and_pressure 4 29845.18
anova(model_distance, model_distance_and_angle, model_distance_and_angle_and_pressure, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: is_goal ~ shot_distance
## Model 2: is_goal ~ shot_distance + angle_to_goal
## Model 3: is_goal ~ shot_distance + angle_to_goal + under_pressure
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 49386 30516
## 2 49385 30289 1 226.42 < 2.2e-16 ***
## 3 49384 29837 1 452.25 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that incorporating shot pressure has continued to the lower the AIC, while continuing to be statistically significant.
Now, let’s try adding some more descriptive variables.
We’ll now add: - Whether the shot was taken with the player’s foot (or with a different part of the body) - Whether the player had an open goal (or not) - Whether the shot was assisted (i.e. was this a self-generated chance or was it created by another player)
model_descriptive <- glm(is_goal ~
shot_distance +
angle_to_goal +
under_pressure +
is_shot_foot +
shot.is_assisted +
shot.type.namePenalty +
shot.open_goal
, data = train_data, family = binomial)
# Summarize the model
summary(model_descriptive)
##
## Call:
## glm(formula = is_goal ~ shot_distance + angle_to_goal + under_pressure +
## is_shot_foot + shot.is_assisted + shot.type.namePenalty +
## shot.open_goal, family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.985036 0.121407 -16.350 < 2e-16 ***
## shot_distance -0.117896 0.004461 -26.430 < 2e-16 ***
## angle_to_goal 0.020688 0.001481 13.965 < 2e-16 ***
## under_pressureTRUE -0.300934 0.039746 -7.571 3.69e-14 ***
## is_shot_footTRUE 1.172737 0.049399 23.740 < 2e-16 ***
## shot.is_assistedTRUE -0.060388 0.038090 -1.585 0.113
## shot.type.namePenalty 2.403548 0.092937 25.862 < 2e-16 ***
## shot.open_goalTRUE 1.455947 0.116027 12.548 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34404 on 49387 degrees of freedom
## Residual deviance: 27773 on 49380 degrees of freedom
## AIC: 27789
##
## Number of Fisher Scoring iterations: 6
AIC(model_distance_and_angle_and_pressure, model_descriptive)
## df AIC
## model_distance_and_angle_and_pressure 4 29845.18
## model_descriptive 8 27789.31
anova(model_distance_and_angle_and_pressure, model_descriptive, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: is_goal ~ shot_distance + angle_to_goal + under_pressure
## Model 2: is_goal ~ shot_distance + angle_to_goal + under_pressure + is_shot_foot +
## shot.is_assisted + shot.type.namePenalty + shot.open_goal
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 49384 29837
## 2 49380 27773 4 2063.9 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comparing the descriptive model to our previous best model, we see that the descriptive metrics have significantly improved our model.
The AIC has lowered, while the deviance has lowered by 1257 while still maintaining statistical significance.
Now that we have a suitable model, we will try to predict goal probabilities on our unseen testing data.
# Predict on the test set (get predicted probabilities)
predicted_probs <- predict(model_descriptive, newdata = test_data, type = "response")
# Create a prediction data frame covering the range of shot distances
pred_data <- data.frame(shot_distance = seq(min(train_data$shot_distance, na.rm = TRUE),
max(train_data$shot_distance, na.rm = TRUE),
length.out = 100))
pred_data$pred_prob <- predict(model_distance, newdata = pred_data, type = "response")
# Plot the relationship between shot distance and predicted probability
ggplot(pred_data, aes(x = shot_distance, y = pred_prob)) +
geom_line(color = "blue") +
labs(title = "Predicted Goal Probability vs. Shot Distance",
x = "Shot Distance from the Goal (Meters)",
y = "Predicted Probability of Scoring") +
theme_minimal()
Our model clearly illustrates that the predicted goal probability decreases exponentially as the shot distance increases.
we can try using the expected goal predictions to compare that to actual goals scored. we’ll use the 2015/16 English Premier League as our first example. We will evaluate actual vs expected goal performance for both teams and players.
We will also load team logos for each of the teams in question to prettify the plot. We collected team image logos for each of the teams we will plot, which we load below.
# List all PNG files in the "logos" folder
logo_files <- list.files("logos", pattern = "\\.png$", full.names = TRUE)
# Extract the team name from each filename by removing the folder path and ".png" extension
# e.g. "logos/Sevilla.png" -> "Sevilla"
team_logos <- data.frame(
logo_path = logo_files,
stringsAsFactors = FALSE
) %>%
mutate(
# Remove the folder path
filename = basename(logo_path),
# Remove ".png" extension
team.name = str_remove(filename, "\\.png$")
) %>%
dplyr::select(team.name, logo_path)
head(team_logos)
## team.name logo_path
## 1 AFC Bournemouth logos/AFC Bournemouth.png
## 2 Arsenal logos/Arsenal.png
## 3 Aston Villa logos/Aston Villa.png
## 4 Athletic Club logos/Athletic Club.png
## 5 Atlético Madrid logos/Atlético Madrid.png
## 6 Barcelona logos/Barcelona.png
Now, we are ready to apply our descriptive model on 2015/16 Premier League teams. We choose 2015/16 because this is the season in which StatsBomb released the most free data.
Note that we add a linear trend line (y = x) to clearly represent the relationship between actual and expected goals. If a team/player’s goal output is above the trend line, then the team/player has scored more goals than expected. If a team/player’s output is below the trend line, then the team/player has scored less goals than expected.
# Filter for EPL 2015/16 and compute predicted xG using model_descriptive
premier_df_team <- xg_df_encoded %>%
filter(competition_name == "England - Premier League", season_name == "2015/2016") %>%
mutate(predicted_xg = predict(model_descriptive, newdata = ., type = "response"))
# Aggregate data at the team level and add a column for actual minus expected goals
team_summary <- premier_df_team %>%
group_by(team.name) %>%
summarise(
actual_goals = sum(as.numeric(is_goal), na.rm = TRUE),
expected_goals = sum(predicted_xg, na.rm = TRUE),
shots = n()
) %>%
mutate(goal_diff = actual_goals - expected_goals)
# Top 10 overperforming teams (largest positive goal_diff)
top_overperformers <- team_summary %>%
arrange(desc(goal_diff)) %>%
slice_head(n = 10)
print("Top 10 Overperformers:")
## [1] "Top 10 Overperformers:"
print(top_overperformers)
## # A tibble: 10 × 5
## team.name actual_goals expected_goals shots goal_diff
## <chr> <dbl> <dbl> <int> <dbl>
## 1 West Ham United 64 61.0 559 3.00
## 2 Tottenham Hotspur 68 65.5 656 2.50
## 3 Manchester City 71 69.3 614 1.75
## 4 Manchester United 46 45.7 431 0.293
## 5 Newcastle United 43 43.3 404 -0.281
## 6 Liverpool 62 62.6 635 -0.581
## 7 Everton 56 56.6 498 -0.583
## 8 Sunderland 44 45.9 451 -1.94
## 9 Southampton 57 59.1 527 -2.09
## 10 Stoke City 40 42.2 425 -2.22
# Top 10 underperforming teams (lowest, i.e. most negative goal_diff)
top_underperformers <- team_summary %>%
arrange(goal_diff) %>%
slice_head(n = 10)
print("Top 10 Underperformers:")
## [1] "Top 10 Underperformers:"
print(top_underperformers)
## # A tibble: 10 × 5
## team.name actual_goals expected_goals shots goal_diff
## <chr> <dbl> <dbl> <int> <dbl>
## 1 West Bromwich Albion 32 46.8 396 -14.8
## 2 Crystal Palace 38 52.2 477 -14.2
## 3 Watford 36 48.4 450 -12.4
## 4 Aston Villa 25 36.7 392 -11.7
## 5 Arsenal 62 72.3 583 -10.3
## 6 Norwich City 38 45.3 424 -7.27
## 7 Swansea City 41 46.7 448 -5.73
## 8 Leicester City 67 71.9 525 -4.86
## 9 AFC Bournemouth 43 47.7 478 -4.71
## 10 Chelsea 55 59.0 535 -3.99
#merge team summary with the team logo dataframe
team_summary_with_logo <- left_join(team_summary, team_logos, by = "team.name")
ggplot(team_summary_with_logo, aes(x = expected_goals, y = actual_goals)) +
geom_image(aes(image = logo_path), size = 0.065) + # Adjust size as needed
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey50") +
labs(
title = "Actual Goals vs Expected Goals (xG) in 2015/16 EPL",
x = "Expected Goals (xG)",
y = "Actual Goals"
) +
theme_minimal(base_size = 12)
We can see that only 3 of the 20 Premier League teams overperformed their expected goal output in 2015/16. Interestingly enough, Arsenal underperformed their expected goal output by a whopping 14 goals. Arsenal finished 2nd, and probably felt that they had a real chance at winning the league had they performed better in front of goal. Due to their struggles in front of goal, they targeted Leicester City striker Jamie Vardy to become their new striker, but ultimately settled for spending €20M on Lucas Perez from Deportivo following Vardy’s rejection.
Now, let’s look at the graph for individual players.
# Filter for Premier League 2015/16 shots and generate predictions
premier_df_player <- xg_df_encoded %>%
filter(competition_name == "England - Premier League", season_name == "2015/2016") %>%
mutate(predicted_xg = predict(model_descriptive, newdata = ., type = "response"))
# Aggregate the data at the player level
player_summary <- premier_df_player %>%
group_by(player.name) %>% # Group by player name
# Normalize the team names for joining
summarise(
team_name = first(team.name),
actual_goals = sum(as.numeric(is_goal), na.rm = TRUE),
expected_goals = sum(predicted_xg, na.rm = TRUE),
shots = n(),
.groups = "drop"
) %>%
mutate(
team_norm = stri_trans_general(team_name, "Latin-ASCII"),
goal_diff = actual_goals - expected_goals) %>%
arrange(desc(actual_goals))
# Top 10 overperforming players (largest positive goal_diff)
top_overperformers_players <- player_summary %>%
arrange(desc(goal_diff)) %>%
slice_head(n = 10)
print("Top 10 Overperforming Players:")
## [1] "Top 10 Overperforming Players:"
print(top_overperformers_players)
## # A tibble: 10 × 7
## player.name team_name actual_goals expected_goals shots team_norm goal_diff
## <chr> <chr> <dbl> <dbl> <int> <chr> <dbl>
## 1 Sergio Leone… Manchest… 24 17.6 118 Manchest… 6.40
## 2 Riyad Mahrez Leiceste… 17 12.7 87 Leiceste… 4.28
## 3 Dimitri Payet West Ham… 9 4.84 69 West Ham… 4.16
## 4 Anthony Mart… Manchest… 11 7.09 57 Manchest… 3.91
## 5 Georginio Wi… Newcastl… 11 7.21 54 Newcastl… 3.79
## 6 Jamie Vardy Leiceste… 24 20.2 118 Leiceste… 3.78
## 7 Kelechi Prom… Manchest… 8 4.34 29 Manchest… 3.66
## 8 Harry Kane Tottenha… 25 21.4 158 Tottenha… 3.64
## 9 Aaron Lennon Everton 5 1.83 18 Everton 3.17
## 10 Roberto Firm… Liverpool 10 6.88 65 Liverpool 3.12
# Top 10 underperforming players (most negative goal_diff)
top_underperformers_players <- player_summary %>%
arrange(goal_diff) %>%
slice_head(n = 10)
print("Top 10 Underperforming Players:")
## [1] "Top 10 Underperforming Players:"
print(top_underperformers_players)
## # A tibble: 10 × 7
## player.name team_name actual_goals expected_goals shots team_norm goal_diff
## <chr> <chr> <dbl> <dbl> <int> <chr> <dbl>
## 1 Cameron Jero… Norwich … 3 8.22 49 Norwich … -5.22
## 2 Shinji Okaza… Leiceste… 5 9.43 44 Leiceste… -4.43
## 3 Wilfried Gue… Manchest… 4 7.93 61 Manchest… -3.93
## 4 Jesús Navas … Manchest… 0 3.12 31 Manchest… -3.12
## 5 Chris Smalli… Manchest… 0 2.97 21 Manchest… -2.97
## 6 Etienne Capo… Watford 0 2.80 35 Watford -2.80
## 7 Oscar dos Sa… Chelsea 3 5.70 51 Chelsea -2.70
## 8 Moussa Sisso… Newcastl… 1 3.68 39 Newcastl… -2.68
## 9 Christian Be… Liverpool 9 11.6 64 Liverpool -2.61
## 10 Aleksandar M… Newcastl… 9 11.6 80 Newcastl… -2.61
# Normalize team names in your team logos lookup table:
team_logos <- team_logos %>%
mutate(team_norm = stri_trans_general(team.name, "Latin-ASCII"))
# Also normalize team names in your team_summary data:
team_summary <- team_summary %>%
mutate(team_norm = stri_trans_general(team.name, "Latin-ASCII"))
# Join with the team logos lookup table using the normalized team name
player_summary_with_logo <- left_join(player_summary, team_logos, by = "team_norm")
#filter out players with 0 goals before plotting
player_summary_with_logo_filtered <- player_summary_with_logo %>%
filter(actual_goals > 1)
ggplot(player_summary_with_logo_filtered, aes(x = expected_goals, y = actual_goals, label = player.name)) +
# Use geom_image() to show the team logo at each point
geom_image(aes(image = logo_path), size = 0.05) + # Adjust size as needed
# Overlay text labels for the players
geom_text_repel(
size = 3,
max.overlaps = 15,
box.padding = 0.35,
point.padding = 0.5
) +
# Add a dashed y = x line
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey50") +
labs(
title = "Actual Goals vs Expected Goals (xG) in 2015/16 EPL",
x = "Expected Goals (xG)",
y = "Actual Goals"
) +
theme_minimal(base_size = 12)
## Warning: ggrepel: 151 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Now, we look at the individual player actual vs expected goal tallies. Interestingly enough, two of the three players who had the greatest overperformance in front of goal were Leicester City’s Riyad Mahrez and Jamie Vardy. Mahrez scored 17 goals from 10.4 expected goals, while Vardy scored 24 goals from 18.6 expected goals. This overperformance in front of goal powered Leicester City to win the Premier League title despite being 5000-1 underdogs at the start of the season. As our results would indicate, this performance was not sustainable, which led to Leicester city almost being relegated in the following season
Now that we have studied the English Premier League in 2015/16, let’s look at the Spanish La Liga in that same 2015/16 season.
# Filter for La Liga 2015/16
la_liga_df_team <- xg_df_encoded %>%
filter(competition_name == "Spain - La Liga", season_name == "2015/2016") %>%
mutate(predicted_xg = predict(model_descriptive, newdata = ., type = "response"))
# Aggregate data at the team level and add a column for actual minus expected goals
team_summary <- la_liga_df_team %>%
group_by(team.name) %>%
summarise(
actual_goals = sum(as.numeric(is_goal), na.rm = TRUE),
expected_goals = sum(predicted_xg, na.rm = TRUE),
shots = n()
) %>%
mutate(goal_diff = actual_goals - expected_goals)
# Top 10 overperforming teams (largest positive goal_diff)
top_overperformers <- team_summary %>%
arrange(desc(goal_diff)) %>%
slice_head(n = 10)
print("Top 10 Overperformers:")
## [1] "Top 10 Overperformers:"
print(top_overperformers)
## # A tibble: 10 × 5
## team.name actual_goals expected_goals shots goal_diff
## <chr> <dbl> <dbl> <int> <dbl>
## 1 Real Madrid 108 81.6 717 26.4
## 2 Barcelona 109 94.1 604 14.9
## 3 Atlético Madrid 62 53.5 483 8.49
## 4 Villarreal 42 35.5 347 6.54
## 5 Granada 45 40.9 434 4.08
## 6 Athletic Club 57 53.8 465 3.20
## 7 Las Palmas 42 39.3 423 2.69
## 8 Celta Vigo 51 49.8 446 1.20
## 9 RC Deportivo La Coruña 41 43.0 457 -2.00
## 10 Rayo Vallecano 52 55.1 501 -3.10
# Top 10 underperforming teams (lowest, i.e. most negative goal_diff)
top_underperformers <- team_summary %>%
arrange(goal_diff) %>%
slice_head(n = 10)
print("Top 10 Underperformers:")
## [1] "Top 10 Underperformers:"
print(top_underperformers)
## # A tibble: 10 × 5
## team.name actual_goals expected_goals shots goal_diff
## <chr> <dbl> <dbl> <int> <dbl>
## 1 Sevilla 50 64.0 472 -14.0
## 2 Málaga 37 48.0 456 -11.0
## 3 Espanyol 38 47.9 399 -9.91
## 4 Getafe 36 45.6 436 -9.58
## 5 Real Betis 34 42.2 405 -8.21
## 6 Eibar 48 55.1 424 -7.12
## 7 Real Sociedad 44 50.7 466 -6.74
## 8 Valencia 43 49.1 410 -6.09
## 9 Levante UD 36 41.5 443 -5.54
## 10 Sporting Gijón 39 43.9 380 -4.94
# Normalize team names in your team logos lookup table:
team_logos <- team_logos %>%
mutate(team_norm = stri_trans_general(team.name, "Latin-ASCII"))
# Also normalize team names in your team_summary data:
team_summary <- team_summary %>%
mutate(team_norm = stri_trans_general(team.name, "Latin-ASCII"))
# Then join on the normalized team names:
team_summary_with_logo <- left_join(team_summary, team_logos, by = "team_norm")
ggplot(team_summary_with_logo, aes(x = expected_goals, y = actual_goals)) +
geom_image(aes(image = logo_path), size = 0.07) + # Adjust size as needed
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey50") +
labs(
title = "Actual Goals vs Expected Goals (xG) in 2015/16 La Liga",
x = "Expected Goals (xG)",
y = "Actual Goals"
) +
theme_minimal(base_size = 12)
Barcelona and Real Madrid were miles ahead of the rest of the league in 2015/16! They each scored the most actual goals, while having the two highest actual vs. expected goal differentials.
Now let’s look at the graph for individual players.
# Filter for La Liga 2015/16 shots and generate predictions
la_liga_df_player <- xg_df_encoded %>%
filter(competition_name == "Spain - La Liga", season_name == "2015/2016") %>%
mutate(predicted_xg = predict(model_descriptive, newdata = ., type = "response"))
# Aggregate the data at the player level
player_summary <- la_liga_df_player %>%
group_by(player.name) %>% # Group by player name
# Normalize the team names for joining
summarise(
team_name = first(team.name),
actual_goals = sum(as.numeric(is_goal), na.rm = TRUE),
expected_goals = sum(predicted_xg, na.rm = TRUE),
shots = n(),
.groups = "drop"
) %>%
mutate(
team_norm = stri_trans_general(team_name, "Latin-ASCII"),
goal_diff = actual_goals - expected_goals) %>%
arrange(desc(actual_goals))
# Top 10 overperforming players (largest positive goal_diff)
top_overperformers_players <- player_summary %>%
arrange(desc(goal_diff)) %>%
slice_head(n = 10)
print("Top 10 Overperforming Players:")
## [1] "Top 10 Overperforming Players:"
print(top_overperformers_players)
## # A tibble: 10 × 7
## player.name team_name actual_goals expected_goals shots team_norm goal_diff
## <chr> <chr> <dbl> <dbl> <int> <chr> <dbl>
## 1 Luis Alberto… Barcelona 40 26.6 139 Barcelona 13.4
## 2 Gareth Frank… Real Mad… 19 8.45 81 Real Mad… 10.6
## 3 Antoine Grie… Atlético… 22 13.6 92 Atletico… 8.38
## 4 Cristiano Ro… Real Mad… 35 28.7 228 Real Mad… 6.31
## 5 Karim Benzema Real Mad… 24 18.3 98 Real Mad… 5.66
## 6 Imanol Agirr… Real Soc… 13 8.65 45 Real Soc… 4.35
## 7 Jozabed Sánc… Rayo Val… 9 4.76 45 Rayo Val… 4.24
## 8 Lucas Pérez … RC Depor… 17 12.8 98 RC Depor… 4.22
## 9 Cédric Bakam… Villarre… 12 7.90 49 Villarre… 4.10
## 10 Iñaki Willia… Athletic… 8 4.02 45 Athletic… 3.98
# Top 10 underperforming players (most negative goal_diff)
top_underperformers_players <- player_summary %>%
arrange(goal_diff) %>%
slice_head(n = 10)
print("Top 10 Underperforming Players:")
## [1] "Top 10 Underperforming Players:"
print(top_underperformers_players)
## # A tibble: 10 × 7
## player.name team_name actual_goals expected_goals shots team_norm goal_diff
## <chr> <chr> <dbl> <dbl> <int> <chr> <dbl>
## 1 Álvaro Vázqu… Getafe 5 9.43 48 Getafe -4.43
## 2 Carlos Alber… Real Soc… 5 8.16 61 Real Soc… -3.16
## 3 Diego Robert… Atlético… 1 3.90 26 Atletico… -2.90
## 4 Pablo Pérez … Sporting… 0 2.81 18 Sporting… -2.81
## 5 Rodrigo More… Valencia 2 4.78 38 Valencia -2.78
## 6 Nordin Amrab… Málaga 0 2.74 28 Malaga -2.74
## 7 David Barral… Granada 0 2.59 17 Granada -2.59
## 8 Roger Martí … Levante … 0 2.51 21 Levante … -2.51
## 9 Grzegorz Kry… Sevilla 0 2.50 22 Sevilla -2.50
## 10 Jorge Andúja… Sevilla 1 3.50 32 Sevilla -2.50
# Normalize team names in your team logos lookup table:
team_logos <- team_logos %>%
mutate(team_norm = stri_trans_general(team.name, "Latin-ASCII"))
# Also normalize team names in your team_summary data:
team_summary <- team_summary %>%
mutate(team_norm = stri_trans_general(team.name, "Latin-ASCII"))
# Join with the team logos lookup table using the normalized team name
player_summary_with_logo <- left_join(player_summary, team_logos, by = "team_norm")
#filter out players with 0 goals before plotting
player_summary_with_logo_filtered <- player_summary_with_logo %>%
filter(actual_goals > 1)
ggplot(player_summary_with_logo_filtered, aes(x = expected_goals, y = actual_goals, label = player.name)) +
# Use geom_image() to show the team logo at each point
geom_image(aes(image = logo_path), size = 0.05) + # Adjust size as needed
# Overlay text labels for the players
geom_text_repel(
size = 3,
max.overlaps = 15,
box.padding = 0.35,
point.padding = 0.5
) +
# Add a dashed y = x line
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey50") +
labs(
title = "Actual Goals vs Expected Goals (xG) in 2015/16 La Liga",
x = "Expected Goals (xG)",
y = "Actual Goals"
) +
theme_minimal(base_size = 12)
## Warning: ggrepel: 137 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
As expected from the team graph, 6 of the 10 players who most outperformed their expected goal tallies play for Real Madrid or Barcelona.
This makes sense, given the purpose of an expected goals model. An expected goals model merely estimates the probability of a goal given the pre-shot context. It does not, however, consider the quality of the player taking the shot. This is why we often see elite players have the largest overperformance relative to their historical expected goals, because they are clinical players who can convert difficult chances at a higher rate than the average player.
This limits the predictive power of simple expected goals models, as they are designed to analyze historical performance and not meant to be predictive. However, there are experimental approaches that use Bayesian methodologies to explore the influence of a particular player or position on predicting goal probabilities.
In this R markdown notebook, we have explored how shooting and passing decisions have evolved over time and created our own expected goals model. We discovered that the game has evolved, with teams taking a lower share of long shots and long passes than we observed 50 years ago. Instead, teams have come to favor retaining possession more, playing a higher share of short passes and spurning speculative shot attempts in favor of working harder to generate higher probability goal-scoring opportunities.
These findings align with the output from our expected goals model. Goal probability decreases exponentially the further back a player elects to shoot from. However, we noticed that the xG model does not account for player-specific finishing quality. This is a known limitation of classical xG models, which treat all players equally regardless of individual skill. Future work could explore these player-specific adjustments! Overall, the reduction in average shot and pass distance over time, particularly the concentration of shots inside the box in recent tournaments has clearly evolved: Football has thus become a much more efficient game.
Lionel Messi Shots in 2022 FIFA World:
For those readers that are keen on learning how Lionel Messi performs in the FIFA, the following section shall analyze this briefly.
To analyze Lionel Messi’s shooting performance in the 2022 FIFA World Cup, we first filter his shots from the dataset.
# Filter shots from Lionel Messi inside the box
shots.messi <- shots_wc %>%
filter(
type.name == "Shot",
player.id == 5503 # Lionel Messi's ID
) %>%
mutate(
goal = ifelse(shot.outcome.name == "Goal", "Goal", "Missed") # Define goal/miss categories
)
The following plot shows the trajectories of Messi’s shots during the tournament, with goals highlighted in red and missed shots in blue.
# Plot pitch with colored shots
create_Pitch() +
geom_segment(
data = shots.messi,
aes(
x = location.x,
y = location.y,
xend = shot.end_location.x,
yend = shot.end_location.y,
color = goal # Color by goal/miss
),
lineend = "round",
size = 0.5,
arrow = arrow(length = unit(0.07, "inches"), ends = "last", type = "open")
) +
# Define colors for shot outcomes
scale_color_manual(values = c("Goal" = "red", "Missed" = "blue")) +
# Labels and formatting
labs(
title = "Lionel Messi",
subtitle = "FIFA World Cup Shots, 2002",
color = "Shot Outcome"
) +
# Adjust field proportions
scale_y_reverse() +
coord_fixed(ratio = 105 / 100)+
theme(
plot.margin = margin(1, 1, 1, 1),
legend.position = "bottom", # Places legend below
legend.title = element_text(face = "bold"), # Bold legend title
legend.key.width = unit(2.5, "cm"), # Adjust legend size
legend.text = element_text(size = 10) # Set legend text size,
)
Messi mostly takes his shots from inside or near the penalty area, rarely shooting from outside the box. This suggests he favors close-range finishing over long-distance attempts, and his goals—highlighted with red arrows—are primarily clustered around the center of the penalty area.